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ABSTRACT 

1  * *•)  t  A  o  /  s 

-We  consider  large.  sparse  linear  systems  which  result  from  the  discretization  of  par- 
tial  differential  equations  on  regular  and  irregular  domains,  and  we  focus- on  the  application 
of  the  preconditioned  conjugate  gradient  (PCCG)  method  to  the  solution  of  such  systems. 
More  specifically,  the  goal  of  this  paper  is  the  efficient  implementation  of  the  PCCG  method 
on  vector  supercomputers.  The  contribution  to  the  above  goal  is  made  by  1)  the  introduc¬ 
tion  of  a  data  structure  which  may  be  effectively  manipulated  on  vector  machines,  2)  the 
utilization  of  preconditioning  matrices  which  are  obtained  by  incomplete  factorization  with 
diagonal  update  sets,  and  3)  the  introduction  of  new  numbering  schemes  for  both  regular 
and  irregular  grids. 


•)  This  work  is.  in  part,  supported  under  ONR  contract  N00014-85-K-0339  and  Air  force  contract 
AFOSR-84-0131 .  The  experimental  results  were  obtained  on  the  CRAY  X-MP  of  the  Pittsburgh  Super¬ 
computing  Center. 

*•)  On  leave  from  the  Department  of  Computer  Science,  Purdue  University.  West  Lafayette.  IN  47907 


1.  INTRODUCTION 


Consider  the  linear  system 

A  x  =  b  (1) 


where  /I  is  a  large  n  Xn  symmetric,  positive  definite  matrix.  It  is  widely  accepted  that  the 
conjugate  gradient  method  (CG)  may  compete  with  direct  methods  for  the  solution  of  (1) 
only  if  the  matrix  A  is  suitably  preconditioned.  In  other  words,  in  order  to  speed  up  the 
convergence  of  the  CG  iteration,  it  is  essential  to  find  a  preconditioning  matrix  M  =  A  .  and 
then  apply  the  CG  method  to  the  solution  of  M~l  A  x  =  M-1  b  .  The  resulting  precondi¬ 
tioned  CG  method  (PCCG)  is  described  as  follows: 


ALG1  -  PCCG: 

Chose  an  initial  guess  x0. 

r0  =  b  —  Ax  0  ;  h0  —  p0  —  M~xr0. 

Repeat  for  i  =0.  ■  •  ■  until  convergence. 

<ri  hi  > 


1)  a,  = 


<Ap,  ./>,  > 


2)  *,+i  =  Xj  +  a,  pi  : 


r,  +i  =  r,  -  a,  A  p, 


t 


3)  hj+i  —  M~l  ri+1 


4)0, 


<r,  ,h,  > 


5)  P,  +i  =  fii+i  +  Pi 

Where.  <x  ,y  >  denotes  the  inner  product  of  x  and  y .  If  the  preconditioning  matrix  is 
chosen  such  that  M  -UT  DU ,  for  some  upper  triangular  matrix  U  and  some  diagonal 
matrix  D  .  then  it  is  possible  to  reduce  the  linear  system  A/fij+1=r, +1  in  step  3  of  ALGl  into 
two  triangular  systems.  The  solution  of  these  systems  is  the  price  which  is  to  be  paid  for 
speeding  up  the  convergence  of  the  basic  CG  method  (for  which  M  —l ). 


Unfortunately,  this  price  may  be  quite  high  when  the  PCCG  is  implemented  on  vector 
supercomputers.  More  specifically,  the  solution  of  a  triangular  system  is  a  recursive  pro- 


cess  which  may  be  very  inefficient  on  vector  and  pipelined  computers.  In  order  to  overcome 
this  difficulty,  alternative  preconditioners  have  been  considered  (see  e.g.  [l.  3.  16.  20,  23]). 
Also,  multicoloring  techniques  have  been  used  to  reorder  the  rows  and  columns  of  the 
matrix  such  that  to  minimize  the  data  dependencies,  and  thus  minimize  the  effect  of  recur¬ 
sion  (see  e.g.  [9.17.21,22]).  However,  experimental  results  [9.17.22]  show  that  mul¬ 
ticolor  orderings  decrease  the  rate  of  convergence  of  the  PCCG  method.  These  results  raise 
the  following  challenge  which  was  suggested  by  David  Young  and  al  [25]:  “Can  we  find  an 
iterative  algorithm  which  is  substantially  better  on  supercomputers  than  the  basic  CG 
method?". 

Numerical  solution  techniques  of  partial  differential  equations  (PDE)  are  major 
sources  of  large  linear  systems.  In  any  of  these  systems,  the  coefficient  matrix  A  is  gen¬ 
erated  by  the  discretization  of  a  PDE  on  a  finite  grid  G  such  that  each  row  of  A 
corresponds  to  a  node  in  G  .  The  specific  discretization  used  defines  a  neighboring  relation 
between  the  nodes  of  G  .  For  example,  if  finite  element  analysis  is  used,  then  two  nodes  are 
neighbors  if  they  belong  to  the  same  element.  This  neighboring  relation  determines  the 
sparsity  structure  of  A  .  More  specifically,  an  element  j  of  A  is  non-zero  only  if  nodes  i 
and  j  are  neighbors  in  G  . 

For  rectangular  domains  covered  by  regular  grids,  the  non-zero  elements  in  A  are 
confined  to  few  diagonals.  In  this  case,  good  preconditioning  matrices  may  be  obtained  by 
the  incomplete  Cholesky  factorization  of  A  [7.  11.  12].  Namely  a  Cholesky  factorization  in 
which  only  selected  elements  of  A  are  modified  during  the  factorization  process.  The  posi¬ 
tions  of  these  selected  elements  are  specified  by  an  update  set  P ,  and  the  corresponding  fac¬ 
torization  is  denoted  by  ICiP).  For  example,  if  P  =  Pa  =  {(t  ,j  ):a,  y  ^  0).  the  set  of  posi¬ 
tions  which  contain  non-zeroes  in  A  .  then  the  corresponding  Cholesky  factorization, 
1C  ( Pa  ).  does  not  allow  any  fill-ins  in  me  matrix  A  .  The  conjugate  gradient  method  which 
uses  a  preconditioned  matrix  obtained  by  an  incomplete  Cholesky  factorization  with  update 
set  P  is  denoted  by  1C  {P  )/PCCG  . 
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If,  on  the  other  hand,  the  matrix  A  is  generated  from  irregular  grids,  then,  its  non¬ 
zero  elements  are  not  confined  to  any  regular  pattern.  In  this  case,  a  suitable  precondition¬ 
ing  of  (1)  may  be  obtained  by  chosing  M  to  be  the  SSOR  matrix  [8], 

In  order  to  obtain  efficient  implementations  of  the  1C  ( P  )/PCCG  methods  on  super¬ 
computers.  advances  have  to  be  made  in  three  fronts.  1)  suitable  data  structures  which  are 
efficiently  manipulated  on  vector  computers  have  to  be  chosen,  2)  better  preconditioning 
matrices  have  to  be  used,  3)  renumbering  schemes  have  to  be  found  which  satisfy  a  balance 
between  the  advantages  of  increasing  the  recursion  span  in  step  3.3  of  ALG1  and  the  disad¬ 
vantages  of  slowing  down  the  convergence  of  the  PCCG  method. 

In  this  paper,  we  consider  each  of  the  above  three  fronts.  More  specifically,  we 
present  in  Section  2  a  data  structure  which  may  be  used  efficiently  for  the  manipulation  of 
general  sparse  matrices  on  vector  computers.  Then  in  Section  3.  we  examine  the  class  of 
preconditioning  matrices  obtained  by  incomplete  Cholesky  factorizations.  By  using  the 
notion  of  update  sets  rather  than  the  fill-in  sets,  we  show  that  the  SSOR  matrices  are 
incomplete  factorization  matrices  with  an  empty  update  set.  Also,  the  zero  extension  fac¬ 
torizations  ICCG  (0)  [13]  and  M1CCG  (0)  [7]  are  special  instances  of  an  incomplete  factori¬ 
zation  IC(Pd  )  in  which  the  update  set  is  given  by  Pd  =  {(t  ,t  )  :  i  =1 . n  ).  This  lC(Pd  ) 

factorization  is,  in  fact,  general,  and  may  be  applied  to  matrices  which  are  generated  from 
irregular  grids. 

In  Section  4.1.  column-wise  multicolor  schemes  are  introduced  for  rectangular  grids, 
and  in  Section  4.2.  numbering  schemes  for  pierced  rectangular  grids  are  described  and  their 
effect  on  the  vectorization  potential  of  the  PCCG  method  is  analyzed.  In  Section  4.3.  the 
multicolor  numbering  schemes  are  generalized  and  applied  to  irregular  grids,  and  finally,  in 
Section  5.  we  present  some  experimental  results  which  have  been  obtained  on  the  CRAY  X- 
MP  vector  computer. 
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2.  THE  STRIPE  DATA  STRUCTURE 

In  [14],  stripe  data  structures  are  introduced  as  a  means  to  include  all  the  non-zero 
elements  of  a  matrix  in  a  structure  which  is  suitable  for  parallel  processing  on  linear  com¬ 
putational  arrays.  In  this  section,  we  will  demonstrate  that  this  same  structure  may  be 
used  effectively  to  process  sparse  matrices  on  pipelined  and  vector  computers. 

Very  briefly,  a  stripe  in  an  nXn  matrix  A  is  a  set  of  positions  S  - 
{(t  ,<r(t ))  ;  i  £j  Qln  }.  where  ln  ={l....,n  }  and  cr  is  a  strictly  increasing  function,  that  is  if 
(t  ,<r(i ))  and  ( j  X r(  /  ))  are  in  S  ,  then 

i  <  j  ==>  cr(t  )<tr(/')  (1) 

Note  that  a  stripe  need  not  include  a  position  for  every  row  of  the  matrix.  If,  however,  it 
does  contain  a  position  for  every  row,  then  it  is  called  a  complete  stripe.  Two  stripes.. 
S i=((t  .(TjCi  ))}  and  S2={(‘ .02(1' ))}  are  ordered  by  S\  <  S2  if.  for  any  (i  .tTjCi  ))eS  1  and 

(j  .<r2(.j  ))eS2  we  have 

i  <  j  =->  o" iCi )  <  tr2(j )  (2) 

With  this,  a  stripe  structure  of  A  is  defined  as  a  set  of  stripes,  =  {S  — w,.  -SV2). 

such  that  1)  S_ff,  <  •  •  •  <  S„2  and  2)  the  position  (i  .j  )  of  any  non-zero  element  a,  j  of  A 
is  in  some  stripe  S*  ,  —v1  ^k  <7 r2.  The  number  of  stripes  7r=7r1+7r2+l  is  called  the  stripe 
count.  In  [14],  an  algorithm  is  given  for  the  construction  of  stripe  structures  for  sparse 
matrices,  in  general,  and  for  finite  element  matrices,  in  particular  [15]. 

The  elements  of  a  striped  matrix  A  may  be  stored  in  an  n  xtt  array.  AE ,  and  the 
position  of  each  element  may  be  specified  in  another  n  xtr  array.  AP .  More  specifically,  we 


may  set 


a'  ■<**(' ) 

AE  (i  ,k  +offset  )  =  q 


if  (i  ,<rt  ( i  ))  6  Sk 
otherwise 


ak  (i ) 

AP(i  ,k  +offset )  =  n+J 


if  (i  ,(T i  (i ))  €  S4 
otherwise 


rests 


s‘u  0  0  i.,'«u  0  0  0  0  0 

0  ^au  0  0  0  0  0 

0  0  jj  0  0\  0  (T^a,.  0  0 

'a4,  0  0  "•««.„  0  xfl4|t  0  0^a4.,  0 

0  0  ^ aSJ  0^aS7  0  d\  0 

°X°  0  "“t*  0 

0  Ov  0  0  ^-<1,3  0  0  0\  ON 

0  o'N,..,  0  o\o  o'^a,.  oN«iM 

0  0  0  ^«,.4  0  \tM  0  0^«„  ON 


0  0  a  i.i  a  i.4  a  x j 

0  0  a  i.i  0 

0  0  a14  0  a,j 

®  a4j  a44  a4t  a4, 

ai.i  au  au  ai,7  0 

0  a  6.4  a  6.4  a  6.7  a4.io 
0  a  a  7.7  0  0 

a»4  0  a»J  a».io  0 
a 7.4  a 44  a,,  0  0 

al«  aMJ  aIO,»0  0 


11  11  1  4  5 
11  11  2  5  11 
11  11  3  11  S 
11  1  4  6  9 

I  2  3  7  11 

II  4  6  9  10 
11  5  7  11  11 

3  11  8  10  11 

4  6  9  11  11 
6  8  10  11  11 


Ca)  The  matrix 


(b)  AE 


Cc  )AP 


Fig  1  -  A  stripe  structure  of  a  sparse  matrix 

Here,  offset  =17!+!,  and  AP(i  Jc  +offset  )=n  +1  is  used  to  indicate  that  Sk  does  not  contain  a 
position  in  row  i .  In  Figure  1.  we  give  an  example  which  illustrates  the  concept  of  stripe 
data  structures.  Clearly,  this  stripe  storage  scheme  is  a  generalization  of  the  diagonal 
scheme  used  in  [25]  and  [10].  It  is  also  a  more  restricted  form  of  the  data  structure  used  in 
[19],  in  which  neither  the  strictly  increasing  property  (l).  nor  the  non-intersecting  property 
(2)  holds.  The  former  property  is  crucial  for  the  efficient  manipulation  of  symmetric 
matrices  on  vector  computers,  while  the  latter  is  only  useful  for  the  manipulation  of  sparse 
matrices  on  linear  computational  networks  [14]. 

If  A  is  a  symmetric  matrix  with  non-zero  diagonal  elements,  then  it  is  possible  to  con¬ 
struct  a  stripe  structure  ZA  —  {S^ .  •  •  •  -STj ).  such  that  the  lower  stripes.  5_t  ,  k  =1 . tt,  . 

are  mirror  images  of  the  upper  stripes,  Sk  .  k  =1 . tt5  .  In  this  case,  only  the  elements  in 

S0-  •  •  ■  -S„s  need  to  be  stored  in  AE  and  AP  (with  offset  =1  in  equations  (3)). 

Given  a  symmetric  matrix  A  in  stripe  storage  form,  the  computation  of  the  product 
vector  y  =Ax  .  for  any  vector  x  .  may  be  accomplished  by  the  multiplication  of  each  ele¬ 
ment  a,  ,crk(i)  in  an  upper  stripe  Sk  with  both  x ,  and  x and  then  the  accumulation  of 
the  results  in  y<,ku)  and  y, .  respectively.  Using  a  pseudo  CRAY-fortran  language,  the 
matrix/vector  multiplication  algorithm  may  be  written  as  follows: 

ALG2  :  Symmetric  Matrix/ Vector  Multiplication 

1)  x  (n  +1)  =  0 


2)  y(ij  —  AE ( i  .offset  )  *  x  (i  ) 

3)  DO  k  =1,  •  •  •  ,ns 


i  =1  ....ji 


3.1)  call  gather(n  ,w  jc.  APilJc  +offset )) 

3.2)  y  (t  )  =  y  (i  )  +  AE  (i  Jc  Aoffset  )  *  w  (t  )  i  =1 . n 

3.3)  z  (i  )  =  AE (i  .£  +offset  )  *  x  (i )  i  =  l,....n 

3.4)  w  (»  )  =  0  i  =l.....n 

3.5)  call  scatter(n.  ,w  AP(\Jt  +offset  ).z  ) 

3.6)  y  (t  )  =  y  (t  )  +  w  (i  )  i  =1 . n 

Step  2  in  ALG2  accumulates  in  y  the  contribution  of  S0.  steps  3.1  and  3.2  accumulate 
the  contribution  of  Sk  .  and  steps  3. 3-3.6  accumulate  the  contribution  of  S-k .  The  strictly 
increasing  property  (l)  is  essential  for  the  correctness  of  the  above  algorithm  because  it 
ensures  that  if  i#/,  then  crk  (i  crk  ( j ).  If  this  is  not  satisfied,  then  z((Tk(i))  and 
z  (or*  ( j  ))  will  be  scattered  to  the  same  location  in  w .  and  thus  only  one  of  them  will  be 
accumulated  in  y . 

Besides  matrix/vector  multiplication,  the  stripe  structure  may  be  used  in  the  solution 
of  triangular  linear  systems.  For  example,  consider  the  upper  triangular  system 
{D  +U  )x  =b .  where  D  is  a  diagonal  matrix,  and  U  is  a  strictly  upper  triangular  matrix 
with  a  stripe  structure  Zj;  =  {Sj,  •  •  •  The  solution  of  this  system  is  a  recursive  pro¬ 
cess  in  which  the  calculation  of  x,  proceeds  in  the  order  i  =n . 1,  with  x,  depending  on  the 

previously  calculated  values  Xj  .  j  >i .  The  minimum  recursion  span,  that  is  the  minimum 
integer  d  such  that  x,  does  not  depend  on  x1+1  •  •  •  jcj+d  .  is  equal  to  the  upper  zero  stretch 
of  the  matrix  defined  as  follows: 

Definition:  The  upper  zero  stretch  of  a  matrix  A  is  the  largest  integer  A2  such  that  a,  :  =0 

for  i  =1 . n  and  i  <  j  <i  +A2-  1°  other  words,  A2  is  the  size  of  the  maximum  band  above 

(and  including)  the  main  diagonal  which  contains  zero  elements. 

For  the  upper  triangular  matrix  U .  the  upper  zero  stretch  is  found,  by  properties  (1)  and 
(2),  to  be 


A2  =  max{  CTjCi  )— i  ;  (i  .cr^t ))  e  S  i  } 


Given  Ly  and  A2,  the  triangular  system  ( D  +U  )x  =b  may  be  solved  using  operations 
on  vectors  of  length  A2.  Similarly,  given  a  strictly  lower  triangular  matrix  L  and  a  stripe 
structure  ZL  =  {S.^,  •  •  •  a  lower  triangular  system  of  the  form  ( L  +Z)  )x  =&  may  be 
solved  using  operations  on  vectors  of  length  Aj,  where  Aj  is  the  lower  zero  stretch  of  L 
given  by 

Aj  =  max{  i  —tr^ii  )  ;  (i  ,cr_i(i  ))  e  S_!  ) 

When  the  IC/PCCG  method  is  applied  to  symmetric  matrices,  it  is  necessary  to  solve 
in  each  iteration  two  triangular  systems  of  the  forms  (L  +D  )x  =b  and  ( D  +U  )y  -x  ,  with 
L  =UT .  If  X^HSi.  •  •  is  a  stripe  structure  for  U ,  then  it  is  possible  to  construct  a 

stripe  structure  Ll  =15-^  .  •  •  •  -S_i)  for  L  such  that  each  stripe  S_t ,  1  < k  $ v5  is  the  mir¬ 
ror  image  of  5*  .  In  this  case,  only  U  needs  to  be  stored.  Let  UE  and  UP  be  the  arrays  used 
to  store  U  in  a  way  analogous  to  equations  (3)  (with  offset  =0).  and  let  A,  be  the  lower 
(upper)  zero  stretch  of  L  W ). 

The  solution  of  (Z,  +D  )x  -b  is  affected  by  the  absence  of  an  explicit  storage  for  the 
elements  of  L.  More  specifically,  consider  the  usual  forward  substitution  algorithm.  In 
this  algorithm,  the  solution  proceeds  such  that,  after  the  computation  of  xlt  •  •  , xr  Aj,  for 
some  r  .  the  next  A*  elements  of  x  are  computed  by.  first,  computing  the  A,  component  z, . 
i-r  A,  +1.  •  •  •  ,(r  +l)Ar ,  of  the  vector  z  =b  —Lx  ,  and  then  dividing  each  z,  by  d, .  the  ith 
component  of  D  .  However,  the  above  scheme  is  not  efficient  because  it 's  difficult  to  access, 
in  UE ,  the  elements  of  T.L  which  belong  to  rows  r  A,+l.  ■  •  ■  ,(r  +1)AS .  In  order  to  over¬ 
come  this  difficulty,  we  use  a  column  sweep  algorithm  ( ji  algorithm  according  to  [5]). 
More  specifically,  we  let  LJ  be  the  j,h  column  of  L  and  we  compute  the  vector 

n  (r  +11A, 

z  —b  —Lx  -b  —^L,  x,  progressively  by  accumulating  £  L,  x,  into  z  as  soon  as  the 

/=1  l=rA,*l 

appropriate  values  of  x  are  computed.  This  requires  the  access  of  the  elements  of  1/  which 
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belong  to  columns  r  A,  +1,  •  ■  •  ,(r  +  1)A, ,  that  is  the  elements  of  Zy  which  belong  to  rows 
r  Aj  +1.  •  •  •  ,(r  +l)At .  These  elements  may  be  easily  accessed  in  UE . 

More  precisely,  the  solution  of  {L  +D  )x  =b  may  be  described  by  the  following  algo¬ 
rithm.  in  which  we  assume,  for  simplicity,  that  n  =AS  m  for  some  integer  m  : 


ALG3 :  Forward  Substitution 


1) z(i  )  =  b(i) 

2)  x  (i  )  =  z  (t  )  /  d  (i  ) 

3)  FOR  r  =0 _ m  —1  DO 

3.0)  Let  ij  =rA,  +1  and  i,  =  (r  +  l)Aj 

3.1)  FOR  k  =1 t r,  DO 

3.1.1  )w(i)  =  UE(iJc)*  x(i) 

3.1.2)  u  (i  )  =  0 

3.1.3)  call  scatter(Aj  .u  ,UP(is  ,k  ),w  (is  )) 

3.1.4)  z(i)  =  z(i)  —  u  (i ) 

3.2)  x(i)  =  z(i)  /  d(i) 


i  =  1 . n 


i  =1 . A. 


i=is.  ■  ■  .1, 


l  =ls  .  ■  ■  ■ 


i  -i, .  •  .n 


i=it+ 1.  ■  ,i,  +Aj 


Note  that  in  3.1.3.  the  A,  elements  of  w  may  be  scattered  anywhere  in  u.  For  this 
reason  the  summation  in  3.1.4  runs  up  to  i  —n  .  An  alternative  way  for  writing  the  inner 
loop  in  step  3.1  is  as  follows 


3.1.1)  call  gather(Aj  ,u  ( i s  ).z(is  ').UP(is  ,k  )) 

3.1.2)  u  (i )  =  u  (i  )  —  UE  ( i  .k)*  x(i) 

3.1.3)  call  scatter(Aj  ,z  ( is  ).UP(is  ,k  ).u  (is )) 


‘  . lr 


Clearly,  this  second  alternative  reduces  the  number  of  operations.  Moreover,  the  subtrac¬ 
tion  and  the  multiplication  are  performed  in  the  same  step  which  allows  for  the  chaining  of 
the  two  operations.  The  choice  between  the  two  alternatives  should  depend  on  the  relative 
execution  time  of  the  different  vector  instructions  on  the  specific  computer  used.  For  exam¬ 
ple.  actual  measurements  on  the  CRAY  X-MP  shows  that  the  slow  down  due  to  the  addi¬ 
tional  call  to  "gather"  in  the  second  alternative  more  than  offsets  the  gain  obtained  from  the 
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chaining  of  the  operations  and  the  lower  operation  count. 

3.  INCOMPLETE  FACTORIZATIONS  WITH  DIAGONAL  UPDATE  SETS 

Given  a  symmetric  positive  definite  matrix  A  of  order  n  ,  the  incomplete  factorization 
of  A  is  a  splitting  of  the  form 

A  =  M  +  R  (4. a) 

where 

M  =  U+UY  D  (/+£/)  (4.b) 

and  1  is  the  identity  matrix.  D  is  a  diagonal  matrix  and  U  is  a  strictly  upper  triangular 
matrix.  The  matrices  D  and  U  are  determined  from  A  by  applying  a  factorization  pro¬ 
cedure  in  which  only  specific  elements  of  A  may  be  updated  during  the  factorization.  More 
specifically,  if  P={(i  ,j  )  :  i  .je[\.n  ]}  is  any  subset  of  positions  in  A  ,  then  the  corresponding 
incomplete  factors  of  A  are  obtained  as  follows: 

ALG4  :  Incomplete  Cholesky  Factorization 
FOR  i  =  1 ,  •  ,n  DO 

^ i , i  “  flj  ,j 

FOR  j  =i  + 1 .  •  ■  ■  ,n  DO 

J  ,j  /  d i  j 

FOR  /  -i  + 1 .  •  •  •  .n  DO 
FOR  k=j  .■  ■  ■  .n  DO 

IF  (j  ,k)  €  P  THEN  dj  M  =a,,  -  u,  J  *  ulX  *  d, 

We  call  P  the  update  set  and  we  call  the  corresponding  factorization  (4.b)  and  split¬ 
ting  (4. a)  the  IC{P )  factorization  and  splitting  of  A  ,  respectively.  A  possible  choice  of  P 
is  the  set  Pa  =  ((t  ,j  ):  a,  ;  ^  0}  of  positions  which  contain  non-zero  elements  in  A  . 

Manteuffel  [ll]  suggested  a  shifted  incomplete  Cholesky  splitting  (S/C  )  in  which  the 
off  diagonal  elements  of  of  A  are  scaled  by  some  factor  w  before  the  factorization.  In  order 
to  be  more  specific,  let  F  and  A  be  the  upper  triangular  and  the  diagonal  parts  of  A  , 


$ 


respectively.  That  is 

A  =FT  +  A  +  F.  (5) 

With  this,  consider  the  shifted  matrix 
A  =  A  +  cj(Ft  +  F) 

and  let  U  and  D  be  the  factors  produced  by  the  application  of  ALG4  to  A  .  The  SIC  ( P  ,o>) 
splitting  of  A  .  corresponding  to  the  update  set  P  and  the  shift  factor  a>.  is  then  given  by 

A  =(/  +U)r  D  (/  +£/)  +  R 

Note  that  SIC  (P  ,co)  reduces  to  IC(P )  for  w=l. 

Factorizations  with  update  sets  larger  than,  or  equal  to,  Pa  have  been  considered  in  the 
literature  (e.g.  [13])  for  matrices  which  are  generated  from  rectangular  grids.  For  these 
matrices,  ALG4  may  be  greatly  simplified.  However,  for  matrices  which  are  generated 
from  irregular  grids,  the  choice  of  P~2.Pa  does  not  seem  useful  because,  even  though  the 
execution  of  ALG4  is  rather  costly,  only  few  of  the  terms  u,  j  *  ulJt  in  the  inner  loop  are 
non-zeroes,  thus  causing  only  few  updates  in  the  entries  of  A  .  In  this  case,  the  SSOR  split¬ 
ting  of  A  may  be  used. 

SSOR  splitting  as  incomplete  factorization. 

If  the  update  set  P  is  taken  to  be  the  empty  set  4>,  then  it  is  easy  to  see  that  the  fac¬ 
tors  U q  and  D  $  resulting  from  ALG4  are  given  by 

D+  =  A 
U+  =  A"1  F 

Also,  if  the  off-diagonal  elements  of  A  are  shifted  by  oj  before  factorization,  then  the 
resulting  factors  are  given  by 

D$  =  A 
0<t>  =  a>  A"1  F 


Hence,  the  SIC  (<J>xu)  preconditioning  matrix  is 


(6. a) 
(6.b) 
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Af0  =  (7  +wA-1fF  A  (7  +(jA-'f)  (7) 

=  (A  +  0)  F  Y  A"1  (A  +  o)  F ) 

Which  is  equal  to  the  SSOR  preconditioning  matrix,  up  to  a  scalar  factor  of  w(2— w).  This 
scalar  factor  does  not  have  any  effect  from  the  point  of  view  of  the  PCCG  method. 

Diagonal  update  Cholesky  factorization. 

Consider  non-shifted  incomplete  factorization.  We  may  envision  a  linear  scale  on 
which  the  preconditioners  obtained  from  ALG4  are  laid  according  to  the  size  of  the  update 
set  P  (see  Fig  2).  Clearly,  the  exact  factorization  lC{Pt )  corresponding  to 

P,  ®K*  •/  )  :  * -j  =1 and  the  SSOR  splitting  7C  (P^)  are  laid  at  the  two  ends  of  the 

scale,  with  IC  ( Pa  )  somewhere  in  between.  The  results  of  Manteuffel  [l  1  ]  for  M  matrices 
show  that,  if  Px  £  P2.  then  the  IC(P  {//PCCG  may  not  outperform  IC  (P2)/PCCG  .  In 
other  words,  the  performance  of  the  IC  /PCCG  improves  when  we  move  to  the  right  on  the 
scale  of  Fig  2. 


«D 

Pd 

Pa 

P, 

size  of  the 
_ ^  undate  set 

SSOR 

diagonal 

update 

standard 

IC 

exact 

Cholesky 

Fig  2  -  The  spectrum  of  incomplete  Cholesky  factorizations 

The  diagonal  update  factorization  is  the  factorization  obtained  from  ALG4  with  the 

update  set  given  by  Pd  -  {(i  ,i )  ;  i  =1 . n  (.  That  is,  only  the  diagonals  of  A  are  updated 

in  ALG4.  If  we  shift  the  matrix  A  by  o>  before  applying  ALG4,  then  it  is  easy  to  see  that 
the  factors  Dd  and  Ud  resulting  from  the  factorization  are  given  by 
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Dd  =  A 

Ud  =w  (A)"1  F 


where,  by  ALG4.  the  elements  A,  ,  i  =1 . n  of  the  diagonal  matrix  A  are  given  by 


i-i 


l  ^  M J  , l  /  ^y 
1=  1 


t  =1 . n 
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The  equations  which  express  the  solution  of  upper  triangular  linear  systems  are  simi¬ 
lar  to  the  above  equations,  with  the  division  replaced  by  a  multiplication.  Assuming  that 
the  execution  of  a  vector  divide  operation  is  at  most  four  times  slower  than  a  vector  multi¬ 
ply  (on  the  CRAY  the  ratio  is  found  to  be  2.5  for  vectors  of  length  6400),  then  it  is  clear 
that  the  cost  of  computing  A  is  at  most  equal  to  the  cost  of  one  PCCG  iteration.  Given  A. 
the  SIC  (Pd  .(d)  preconditioning  matrix  is 

Md  =  (I  +  to  A-1  F)T  A  (/  +  w  A  F)  (8) 

In  addition  to  being  easy  to  compute,  both  SSOR  and  SIC  ( Pd  .to)  preconditioners  have 
an  advantage  over  factorizations  with  update  sets  larger  than  Pd  .  Namely,  they  may  lead 
to  reduced  computational  work  in  each  PCCG  iteration.  More  specifically,  each  iteration  in 
ALG1  requires,  in  addition  toO(n)  operations,  both  the  matrix/vector  multiplication  Ap . 
and  the  solution  of  two  triangular  systems  in  step  3.  However,  for  SSOR  and  SIC(Pd  .(d) 
preconditioners.  ALGl  may  be  rewritten  such  that  it  requires  only  the  solution  of  two  tri¬ 
angular  systems,  thus  reducing  the  work  per  iteration  by  a  factor  of.  almost  2.  For  exam¬ 
ple.  if  the  SSOR  matrix  (7)  is  used  in  ALGl.  and  the  following  substitution  is  made 

H  =  (A  +  (o  F)~r  A  (A  +  (d  F)~l.  (9) 


then,  it  is  possible  to  rewrite  the  algorithm  such  that  it  only  involves,  besides  0(n  )  opera¬ 
tions.  the  multiplication  of  a  vector  p  by  H  (see  [6]  for  details).  From  (9)  and  (5),  this 
multiplication  may  be  computed  as  follows: 

H  p  =  (d2  (v  +  (—A  +  F  )~T  (/>  +  -  ^  A  v  )  )  (10) 

(D  (D 


where  v  =  (—A  +  F  )_1  p  ■ 

(O 

two  triangular  systems. 


That  is  the  multiplication  H  p  may  be  performed  by  solving 


The  same  argument  applies  to  the  SIC  ( Pd  ,(d)  matrix  (8).  which  has  the  same  form  as 
(7),  except  that  A  is  replaced  by  A.  In  this  case,  the  multiplication  Hp  may  be  obtained 
from  equation  (10)  with  A  replaced  by  A. 


V.' 


Given  that  Pd  24>,  then  according  to  [ll],  IC(Pd  )/PCCG  is  expected  to  outperform 
SSOR  /PCCG  for  M  matrices.  The  examples  considered  in  Section  5  suggest  that,  in  gen¬ 
eral,  1C  (Pd  )/PCCG  is  at  least  as  efficient  as  the  SSOR  I  PCCG  . 

4.  MULTICOLOR  NUMBERING  SCHEMES 

The  major  difficulty  in  the  vectorization  of  the  IC/PCCG  method  concerns  the  recur¬ 
sive  nature  of  the  solution  of  triangular  linear  systems.  In  order  to  overcome  this 
difficulty,  many  multicolor  numbering  schemes  have  been  suggested  [9,  17,  21,  22]  for  rec¬ 
tangular  grids.  Their  goal  is  to  obtain  a  coefficient  matrix  A  which  may  be  partitioned  into 

submatrices  A,  j  .  i  ,j  =1 . p  .  for  some  p  .  such  that  A{  , .  i  =1 . p  .  are  diagonal  matrices. 

This  goal  is  achieved  by  applying  an  algorithm  which  has  the  following  form: 

ALG5  :  A  Global  Multicolor  Numbering  Scheme 

1)  Assign  to  each  node  in  the  grid  a  color  from  a  set  of  p  colors  such  that  neighbor¬ 
ing  nodes  have  different  colors.  Chose  p  to  be  as  small  as  possible. 

2)  For  each  color  c  =1 . p  ,  number  all  the  nodes  which  have  color  c  in  a  column¬ 

wise  sequential  order. 

The  above  numbering  scheme  is  global  in  the  sense  that  all  the  nodes  which  have  the 
same  color  are  numbered  consecutively.  Although  global  numbering  schemes  produce 

matrices  with  large  zero  stretch  (approximately  — .  where  n  is  the  number  of  nodes  in  the 

P 

grid),  experience  [9.  17.  22]  shows  that  these  schemes  worsen  the  condition  of  the  matrix, 
thus  slowing  down  the  convergence  of  the  IC/PCCG  method.  In  this  section,  we  suggest  a 
numbering  scheme  which  compromises  between  the  zero  stretch  of  the  matrix  and  the  con¬ 
vergence  properties  of  the  PCCG.  We  first  introduce  the  scheme  for  rectangular  grids. 

4.1.  Column- wise  multicolor  numbering  of  rectangular  grids 

Let  T  be  an  upper  (or  a  lower)  triangular  matrix  given  in  striped  form.  It  was  shown 
in  Section  2  that  the  larger  the  zero  stretch  of  T ,  the  more  efficient  is  the  solution  of  Tx  =b 


on  vector  computers.  However,  the  advantage  of  having  large  zero  stretches  when  the 
IC  /PCCG  is  applied  to  the  solution  of  linear  systems  should  be  put  in  its  proper  prospec¬ 
tive.  More  specifically,  it  is  important  to  note  the  following: 

1)  the  zero  stretch  does  not  affect  the  matrix/vector  multiplication,  the  inner  products  com¬ 
putation  or  the  vector  addition  operations  in  ALG1. 

2)  the  solution  of  a  triangular  linear  system  includes  data  movement  operations  (e  g.  scatter 
and  gather  in  ALG3).  These  operations  are  slower  than  vector  arithmetic  operations  on 
existing  vector  computers,  and  hence  they  dominate  the  execution  time  of  the  solution. 

3)  the  advantage  of  having  long  vectors  is  relatively  limited  in  data  movement  operations 
compared  to  arithmetic  operations.  More  specifically,  the  execution  time  of  a  vector  arith¬ 
metic  operation  on  a  pipeline  computer  is  usually  specified  by  Ts  +8rp  .  where  ts  is  a  vector 
setup  time.  rp  is  pipe  unit  time  and  8  is  the  vector  length.  Usually.  Ts  is  much  larger  that 
Tp  .  which  makes  it  very  advantageous  to  use  long  vectors.  For  example,  a  multiply  /add 
vector  operation  on  the  CRAY  X-MP  consumes  70  and  170  fi- sec.  for  8  *  640  and  6400. 
respectively.  On  the  other  hand,  using  long  vector  in  data  movement  operations  is  less- 
advantageous  because  the  pipe  unit  time  for  vector  data  movement  operations  is  relatively 
large,  and  depends  on  the  distribution  of  data  in  memory.  For  example,  a  specific  gather 
operation  on  the  CRAY  X-MP  consumes  40  and  370  n~ sec  for  8  «  640  and  6400.  respec¬ 
tively.  The  corresponding  times  for  a  scatter  operation  are  90  and  410  ^t-sec.,  respectively. 

Given  the  above  facts,  our  main  goal  in  the  vectorization  of  the  IC  /PCCG  method 
should  be  to  increase  the  zero  stretch  in  the  matrices  to  a  point  which  prevents  the  recursive 
solution  of  triangular  systems  from  dominating  the  entire  process.  However,  once  the  exe¬ 
cution  time  of  ALGl  is  not  dominated  by  the  time  for  step  3,  any  further  increase  of  the 
zero  stretches  will  have  a  relatively  limited  effect  on  the  execution  lime  of  the  1C  /PCCG 
iteration.  For  instance,  when  step  3  consumes  only  one  fourth  of  the  execution  time  of 
ALGl.  the  doubling  of  the  speed  of  step  3  speeds  up  the  entire  algorithm  by  less  than  1.15. 


The  above  discussion  suggests  that  a  numbering  scheme  which  produces  "reasonably 
conditioned"  matrices  with  "reasonable"  zero  stretches  may  be  better  than  a  numbering 
scheme  which  produces  "ill  conditioned"  matrices  with  very  large  zero  stretches.  The 
column-wise  multicoloring  scheme  belongs  to  the  former  class.  It  is  described,  for  an  r  Xm 
rectangular  grid,  by  the  following  algorithm: 

ALG6  :  A  Column-Wise  Multicolor  Numbering  Scheme 

1)  Assign  to  each  node  in  the  grid  a  color  from  a  set  of  q  colors  such  that  neighbor¬ 
ing  nodes  that  are  in  the  same  column  have  different  colors.  Chose  q  to  be  as 
small  as  possible. 

2)  FOR  j  =1.  .m  DO 

FOR  each  color  c=l....,^.  number,  sequentially,  the  nodes  in  column  j 
which  have  color  c  . 

It  may  be  shown  that  ALG6  produces  matrices  with  zero  stretches  equal  to  — .  while 

« 1 

ALG5  produces  matrices  with  zero  stretches  The  minimal  values  of  q  and  p  for 

P 

different  discretization  stencils  are  given  in  Table  1,  where  FDS  denotes  the  5-point  finite 
difference  discretization,  and  FE3,  FEA.  FEb  and  FEq  denote  finite  element  discretizations 
with  3-node  triangles.  4-node  rectangles,  6-node  triangles  and  9-node  rectangles,  respec¬ 
tively. 


FDS 

fe3 

FE  4 

FEb 

FEq 

p  for  global  schemes 

2 

3 

4 

6 

9 

q  for  column-wise  schemes 

2 

2 

2 

3 

3 

Table  1  -  minimal  number  of  colors  in  multicoloring  schemes 

4.2.  Numbering  pierced  rectangular  grids 

Pierced  rectangular  grids  (see  Fig  3)  are  defined  in  [  1 5]  to  be  rectangular  grids  from 
which  some  subrectangles  are  removed.  This  type  of  grids  is  useful  because  any  irregular 
domain  may  be  covered  by  a  pierced  rectangular  grid,  or  by  a  grid  which  is  isomorphic  to  a 


pierced  rectangular  grid. 
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Fig  3  -  A  pierced  rectangular  grid  numbered  with  a 
2-color  column-wise  scheme. 

Let  Q  be  a  rectangular  grid  which  contains  nQ  nodes  numbered  by  the  integers  1 . nQ 

according  to  some  numbering  scheme,  and  let  ft  be  a  grid  which  is  obtained  by  piercing  Q 
and  which  contains  n  n  nodes.  A  renumbering  of  the  nodes  which  are  in  ft  may  then  be 
defined  by  a  function  v  that  assigns  to  each  node  i .  which  is  in  both  Q  and  ft.  a  unique 
number  v(i  ).  1  ^v(i  n.  This  renumbering  is  said  to  be  deduced  from  the  numbering  of 
Q  if  v  is  derived  as  follows 

ALG7  :  Numbering  Pierced  Rectangular  Grids 
l  =  0 

FOR  i  -  1 .  •  ••  jiq  DO 

IF  node  i  is  in  ft  THEN  { l  =  l  +  1  ’.  r(i )  =  l  I 
ELSE  j  v(i )  =  T  (undefined)  } 

The  upper  (or  lower)  zero  stretch  of  the  symmetric  matrix  generated  from  the  pierced 
rectangular  grid  ft  is  then  given  by 

A  =  mini  I  v(i  )— v(j  )  I  ;  i  and  j  are  neighboring  nodes  in  ft)  (11) 


The  specific  value  of  A  in  (11)  depends  on  the  shape  of  Cl  and  on  the  numbering 
scheme.  However,  due  to  the  regularity  of  global  and  column-wise  multicolor  numberings, 
it  is  possible,  in  both  schemes,  to  establish  upper  and  lower  bounds  for  A.  In  the  remainder 
of  this  section,  we  will  find  such  bounds  for  the  FE 4  and  FE9  discretizations.  Similar 
bounds  may  be  derived  for  other  discretizations.  We  start  by  deriving  bounds  on  A  for 
column  wise  multicolor  numberings. 

Theorem  1:  If  either  FEA  or  F£9  discretization  is  used,  and  a  column-wise  multicoloring 
scheme  is  applied  with  q  =2  or  3,  respectively,  then 

^  min  ^  <  k  C 

2  ^  A  \  ce  mjn 


where  cemin  is  the  minimum  number  of  elements  in  any  column  of  elements  in  Cl. 

Proof:  see  the  appendix. 

For  global  coloring  schemes,  the  bounds  on  A  may  be  given  in  terms  of  the  total 
number  of  nodes  n  in  fl.  For  this,  we  first  estimate  the  number  of  nodes  in  Cl  which  are 
colored  by  any  particular  color. 

Lemma  1:  Let  the  n  n  nodes  in  a  pierced  rectangular  domain  Cl  be  numbered  by  first  using 
a  global  multicolor  scheme  to  number  the  smallest  rectangular  domain  Q  which  enclose  Cl . 
and  then  deducing  the  node  numbers  in  Cl  using  ALG7.  If  FE  A  or  FE9  discretizations  are 
used  with  p  =4  and  9.  respectively,  then  for  any  particular  l .  ,  the  number.  n‘n  .of 

nodes  given  the  color  l  is  bound  by 


2\fp  —  1 


n  n  < 


'fp  +1 


nn. 


(12) 


Proof:  see  the  appendix 

In  Figure  4.  we  color  some  specific  rectangular  grids  to  show  that  the  bounds  in 
Lemma  1  are  tight  (the  numbers  in  Fig  4  refer  to  colors).  More  specifically,  for  FEA.  the 

grid  in  Figure  4(a)  gives  n  ^  =‘^'^  o  and  n  A  ~~^n  n-  Similarly,  for  FE9.  n  $  a  in  Fig- 


ure  4(b).  and  n  ^  =—n  n  in  Figure  4(c).  Clearly,  the  grids  in  Fig  4  are  rectangular  grids. 

4 


which  are  special  cases  of  pierced  rectangular  grids. 


Figure  4  -  Tightening  the  bounds  of  Lemma  1 


The  following  theorem  uses  Lemma  1  to  bound  the  zero  stretch  of  the  matrix  resulting 
from  the  finite  element  discretization. 

Theorem  2:  Given  the  hypothesis  of  Lemma  1.  the  zero  stretch  of  the  matrix  resulting  from 
the  finite  element  discretization  is  bounded  by 

,  2 

na-  1  A  <  -=£■ —  [«  n  '*■/’]■  (13) 

•Jp  +1 

Proof:  see  the  appendix. 

Given  a  pierced  rectangular  grid  ft .  the  above  bounds  may  be  used  for  the  apriori  esti¬ 
mation  of  the  zero  stretch  which  result  from  specific  numbering  schemes.  The  experiments 
of  Section  5  show  that  the  convergence  rate  of  the  PCCG  method  is  inversely  related  to  the 
zero  stretch.  A.  of  the  matrix,  and  that  beyond  a  specific  value  of  A.  the  reduction  in  the 
execution  time  of  each  PCCG  iteration  does  not  offset  the  increase  in  the  number  of  itera¬ 
tions.  In  this  context,  apriori  estimates  of  A  may  help  in  chosing  the  best  numbering 
scheme  for  ft. 

4.3.  A  multicolor  numbering  f or  irregular  grids 

In  finite  element  analysis,  it  is  often  useful  to  change  the  density  of  the  discretization 
points  among  the  different  regions  of  the  domain.  This  is  especially  advantageous  if  a  rough 
estimate  of  the  solution  is  known  or  if  an  adaptive  solution  technique  is  applied.  In  both 
cases,  it  is  preferable  to  increase  the  density  of  the  discretization  points  in  the  regions  where 


the  solution  is  known  to  (or  is  found  to)  change  rapidly.  This  desirable  variation  in  density 
may  not.  usually,  be  accomplished  without  the  use  of  irregular  grids. 

All  previous  schemes  for  numbering  irregular  grids  aimed  at  the  reduction  of  the 
band-width  and  the  profile  of  the  resulting  matrices.  This  is  particularly  useful  if  direct 
methods  are  used  for  the  solution  of  the  linear  systems  resulting  from  the  grids.  However, 
as  we  explained  in  Section  2,  if  iterative  solution  methods  are  applied  and  vector  computers 
are  used,  then  the  efficiency  of  the  solution  process  is  determined  by  the  zero  stretch  of  the 
matrix.  In  this  section,  we  describe  a  multicoloring  technique  which  aims  at  increasing  the 
zero  stretch  of  the  matrix. 

Let  G  be  the  graph  corresponding  to  a  given  irregular  grid.  That  is.  the  nodes  in  G 
correspond  to  the  nodes  in  the  grid,  and  any  two  neighboring  nodes  in  the  grid  are  connected 
by  an  edge  in  G  .  Guided  by  the  global  multicoloring  scheme  for  rectangular  grids,  our  goal 

is  to  partition  G  into  disjoint  subsets  of  nodes  Gl .  I  =1 . p  .  such  that  l)  p  is  as  small  as 

possible.  2)  no  two  node  in  any  partition  G‘  are  neighbors,  and  3)  the  variation  in  the  size 

of  the  partitions  Gl .  l  =1 . p  is  minimal.  Given  such  a  partitioning,  it  is  possible  to 

number  the  nodes  in  each  subset  consecutively.  However,  this  does  not  guarantee  a  large 
zero  stretch  if  the  numbering  of  the  nodes  within  each  partition  G '  is  done  arbitrarily.  For 
example,  if  the  node  numbered  first  in  Gl  is  connected  to  the  node  numbered  last  in  G/_1. 
then  the  zero  stretch  is  equal  to  unity.  In  other  words,  the  relation  between  the  nodes  in 
G1  and  G/_1  should  be  taken  into  consideration  in  the  numbering  process. 

The  numbering  scheme  which  we  suggest  consists  of  the  following  four  steps: 

1)  The  generation  of  a  level  structure  j  V  l,V 2,  ■  ■  ■  },  such  that  the  nodes  in  each  level  Vu 
are  not  connected  to  the  nodes  in  levels  V„. ,  w  >u  +1  or  w  <u  —  1.  In  other  words, 
nodes  in  Vu  may  only  be  connected  to  nodes  in  Vu~x  and  Vu+1. 

2)  For  each  level  Vu  ,  the  partitioning  of  the  nodes  in  Vu  into  the  minimum  number  of 
independent  sets  V,,1,  V2,  ■  ■  such  that  the  nodes  in  each  VUK  are  independent.  That  is 
no  two  nodes  in  are  connected  by  an  edge  in  G  . 
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3)  The  formation  of  the  partitions  Gl .  and 

4)  The  actual  numbering  of  the  nodes. 

In  order  to  generate  the  level  structure,  we  start  from  an  arbitrary  initial  level  Vj. 
and  we  proceed  in  a  way  very  similar  to  the  Cuthill-Mckee  Algorithm  [4].  More 
specifically,  given  a  level  Vu ,  we  construct  the  next  level  Vu  +1  by  including  in  it  any  node 
which  is  connected  to  a  node  in  Vu  and  which  is  not  in  any  previous  level.  A  criteria  which 
should  guide  the  choice  of  V  t  is  the  desire  to  have  £  1^1=  £  I  ( .  where  I  Vu  I  is 

u  ^tven  u  =odd 

the  number  of  nodes  in  Vu .  For  example,  if  we  denote  by  CNU  the  u'h  column  of  nodes  in 
a  pierced  rectangular  domain,  then  for  FE 4  discretization.  \CN  X.CN  2—-\  is  a  level  structure 
in  which  the  ratio  of  £  \CNU  I  to  £  \CNU  I  may  be  proven  to  be  larger  than  0.5  and 

u  =odd  u  =*ve/j 

smaller  than  2  (see  proof  of  Lemma  1  in  the  appendix). 

Given  a  specific  level.  Vu .  the  optimal  generation  of  the  independent  sets  Vul.Vu2,  ■  ■  ■ 
is  known  to  be  NP  complete.  However,  a  simple  algorithm  which  may  be  used  for  this  gen¬ 
eration  proceeds  by  considering  the  nodes  in  Vu  in  any  given  order.  For  each  node  i .  i  is 
added  to  the  first  set  VUK  which  does  not  contain  a  node  that  is  a  neighbor  to  i .  If  this  con¬ 
dition  is  not  satisfied  for  any  of  the  sets  constructed  so  far.  then,  a  new  set  which  contains 
only  node  i  is  created.  Let  su  be  the  number  of  independent  sets  constructed  for  Vu  .  and 
let  I  Vuk  1 .  1  be  the  number  of  nodes  in  the  set  V„\ 

For  pierced  rectangular  domains  and  FE4  discretizations,  the  number  of  independent 
subsets  is  constant  for  any  level  u  .  Namely  su  =2.  However,  for  general  grids,  su  may  not 
be  constant.  In  this  case,  we  define  rmax  =  max{ju  (.  Note  that,  for  su  <A^jmax.  the  sets  VUK 

u 

are  empty. 


Given  the  independent  sets,  the  partitions  Gl.G2.  ■  may  be  constructed  by  the  com¬ 
bination  of  the  appropriate  independent  sets  such  that  the  variation  in  the  size  of  the 
different  partitions  is  as  small  as  possible.  For  this,  the  following  algorithm  may  be  applied 


ALG8 :  Creation  of  independent  patritions 


For  1=1,  ■■■  ,2  s  max,  initialize  G1  to  the  empty  set. 

For  u  =1,3.5.  -  Do 

For  X=l.  •  •  •  ,jmax  Do 

Chose  the  partition  Gl .  1  ^  l  ^  jmax  which  contains,  so  far.  the  minimum 
number  of  nodes  (break  ties  arbitrarily),  and  set  Gl  =  Gl  (J  VUK. 

For  u  =2,4,6,  -  Do 

For  X=l.  ■  •  •  s max  Do 

Chose  the  partition  Gl .  s  max  <  l  ^  2  Jmax  which  contains,  so  far,  the 
minimum  number  of  nodes  (break  ties  arbitrarily),  and  set 

Gl  =  G‘  U 


After  obtaining  the  2rmax  partitions,  we  may  then  start  the  actual  numbering  of  the 
nodes.  For  this,  we  consider  the  partitions  in  the  order  G  l.  G2 . and  within  each  parti¬ 

tion.  G' ,  we  number  the  nodes  sequentially  such  that  if  V*  and  V,*-  are  two  subsets  in  G' 
with  v  >u  .  then  the  nodes  in  V*  are  numbered  before  the  nodes  in  V/. 


Clearly.  ALG8  guarantees  that  the  sizes  of  the  partitions  G1 ,  l  =1,  •  ,jmax  does  not 

differ  by  more  than  the  size  of  the  largest  subset  among  all  subsets  VUK,  X=1 . xmax. 

u  =1.2 .  The  same  applies  to  the  partitions  G! .  I  =s  max+l .  •  •  •  ,2jmax.  However,  no  rela¬ 

tion  may  be  established  between  the  size  of  two  partitions  G1  and  G\  l  max  and  X>jmax, 
without  an  explicit  assumption  about  the  size  of  each  level  Vu  with  respect  to  the  following 
level  V„+1.  If  an  assumption  of  this  kind  is  made,  then  it  may  be  possible,  following  a  rea¬ 
soning  similar  to  that  of  Theorem  2,  to  find  a  lower  bound  on  the  value  of  the  zero  stretch 
of  the  matrix  resulting  from  this  multicolor  numbering  scheme.  We  will  not  pursue  this 
issue  any  further. 


We  illustrate  our  numbering  scheme  by  applying  it  to  the  grid  of  Figure  5.  In  this 
figure,  a  level  structure  )V/i>  •  ■  ,V9)  is  indicated  by  bold  dashed  lines,  and  within  each 
level  Vu  ,  the  nodes  which  are  assigned  to  the  subsets  V/,  Vu2  and  V,,3  are  marked  by  the 


symbols  •,  □  and  A,  respectively.  The  application  of  ALG8,  then,  results  in  the  following 


partitions: 


Fig  5  -  multicolor  numbering  of  an  irregular  grid 


G1  =  V,1  U  V33  U  V52  |J  V9l 

G2  =  V,2  U  V32  |J  Vi  (J  Vi 

G 3  =  V 31  (J  Vi  U  W  U  ^93 

G*=Vl  (J  Vi  (J  Vi 

G5  =  V|  U  Vi  (J  V2 

G 6  =  v 23  U  v*  U  vi 


which  leads  to  the  numbering  shown.  It  is  easy  to  check  that  the  zero  stretch  resulting  from 
this  numbering  is  A  -  8.  How  good  is  this  result  compared  with  the  maximum  possible  zero 
stretch  for  this  specific  grid?,  and  is  there  any  numbering  scheme  which  is  not  NP  complete, 
and  which  will  produce  the  maximum  zero  stretch?.  We  do  not  know  the  answer  to  these 
questions,  but  some  preliminary  results  lead  us  to  believe  that  the  answer  to  the  second 
question  is  negative. 

Finally,  we  should  note  that  the  above  numbering  scheme  is  global  in  the  sense  that 
all  the  nodes  which  are  given  a  particular  color  are  numbered  consecutively.  A  level-wise 
numbering  scheme  analogous  to  the  column-wise  scheme  of  Section  4.1  may  also  be  applied. 
However,  the  construction  of  the  subsets  Vj.Vu2.  in  this  case  should  be  done  such  that 
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the  size  of  these  subsets  is  as  uniform  as  possible. 

5.  NUMERICAL  RESULTS 


& 

iii 


v.v 
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In  order  to  evaluate  the  diagonal  update  technique  of  Section  3,  and  to  test  the 
column-wise  multicolor  numbering  scheme  of  Section  4,  we  implemented  the  PCCG  method 
on  the  CRAY  X-MP  using  the  stripe  structure  approach  of  Section  2.  The  implementation 
allows  for  the  solution  of  general  symmetric  sparse  systems  and  provides  for  no  precondi¬ 
tioning.  SSOR,  SIC  (Pd  ,  oi)  or  SIC{Pa.u j)  preconditioning.  The  PCCG  iteration  is  stopped 
when  the  2-norm  of  both  the  residual  and  the  step  size  fall  below  10-5. 

The  linear  systems  used  in  our  experiments  are  generated  by  using  a  modified  version 
of  the  finite  element  code  of  [2].  The  modification  incorporates  the  stripe  data  structure  and 
provides  options  for  l)  the  automatic  specification  of  pierced  rectangular  domains  and  grids. 
2)  the  choice  of  the  numbering  scheme  (one  color,  column-wise  multicolor  or  global  mul¬ 
ticolor).  and  3)  the  application  of  finite  difference  discretization. 

In  this  section,  we  report  the  results  obtained  for  the  following  three  problems: 

Problem  1: 

V2«  —  u  =  0  on  D 

u  —  1  +  x  y  on  QD 

where  D  is  the  unit  square.  For  this  problem  we  consider  both  FD 5  and  F£4  discretiza¬ 
tions.  For  FD 5,  we  consider  many  grid  sizes  and  we  report  in  Table  2  the  number  of  itera¬ 
tions  I  and  the  execution  time  T  (in  seconds)  for  the  PCCG  corresponding  to  the  numbering 
schemes  discussed  in  Section  4.  We  also  report  I  and  T  for  the  basic  (non-preconditioned ) 
conjugate  gradient  method.  The  execution  times  are  omitted  for  the  64x64  grid  because 
they  are  relatively  small  and  thus  inconclusive.  Yote  that  the  global  two  color  numbering 
scheme  is  the  usual  red/black  numbering  scheme. 

The  results  in  Table  2  are  obtained  by  fixing  the  shift  factor,  cj,  (over  relaxation  fac¬ 
tor)  to  unity.  In  order  to  observe  the  effect  of  changing  w.  we  report  in  Table  3,  the  results 
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grid 

size 

precond. 

method 

one 

color 

global 

2-colors 

column-wise 

2-colors 

no  prec. 
1-color 

64X64 

SSOR 

IC(Pd) 

70/ 

59/ 

97/ 

97/ 

84/ 

83/ 

191/ 

100x100 

SSOR 
IC(Pd ) 

110/5.38 

90/4.42 

153/2.54 

153/2.57 

131/2.72 

129/2.69 

150X150 

SSOR 

1C  (Pd ) 

163/17.6 

136/14.0 

231/8.6 

231/8.7 

197/7.9 

193/7.8 

200X200 

SSOR 

1C  {Pd ) 

218/41.3 

181/34.5 

308/20.3 

308/20.3 

262/18.4 

258/18.3 

586/13.1 

Table  2  -  The  number  of  PCCG  iterations  and  the  execution  time  (in  seconds) 
for  Problem  1  with  FD  5  discretization  (o>=l). 


grid 

size 

precond. 

method 

one 

color 

global 

2-colors 

column-wise. 

2-colors 

I/T 

I/T 

E  1 

I/T 

Em 

200X200 

SSOR 

SIC  (Pd  ,6J) 

51/10.1 

53/10.2 

1.95 

1.23 

308/20.3 

308/20.3 

1.0 

1.0 

253/17.9 

253/17.9 

m 

Table  3  -  The  number  of  PCCG  iterations  (I)  and  the  execution  time  (T)  for 
Problem  1  with  FD  $  and  the  optimal  choice  of  to. 


precond. 

method 

one 

color 

global 

4-colors 

column-wise 

2-colors 

I/T 

fcj 

I/T 

(0 

I/T 

0) 

SSOR 

120/19.6 

1.0 

144/11.6 

1.0 

134/7.8 

1.0 

1C  (Pd  ) 

112/18.4 

1.0 

143/11.7 

1.0 

127/7.6 

1.0 

I/T 

ESS 

I/T 

EH 

I/T 

SSOR 

37/6.3 

1.95 

144/11.6 

1.0 

94/5.5 

wm 

SIC  (Pd.0j) 

38/6.5 

1.55 

144/11.7 

1.0 

85/5.3 

wm 

I/T  =  273/4.89 

Table  4  -  Results  for  problem  1  with  FEA  on  a  150x150  grid. 


for  the  SSOR/PCCG  and  the  SIC  ( Pd  ,to)/PCCG  methods  with  u)=u)ap,  .  the  optimal  value  of 
to  which  gave  the  fastest  convergence.  In  Table  4,  we  combine  the  results  for  to=l  and 
to=<o0pr  one  table  for  the  FEA  discretization. 

Problem  2 :  (PROB  29  in  [18]) 


V  y  V«  =0  on  D 
u  —  x  —  y  on  QD 


where  D  is  the  unit  square.  The  results  of  FEA.  FE3  and  FE 9  discretizations  for  this  prob¬ 
lem  are  given  in  Tables  5  and  6.  At  this  point,  we  would  like  to  note  that  the  running  times 
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reported  for  this  and  the  previous  problems  seem  high  compared  to  the  running  times  which 
are  reported  in  other  papers  (e.g.  [24.  25]).  This  should  not  be  surprising  because  our 
implementation  is  designed  for  general  sparse  matrices,  and  hence  does  not  assume  any  spe¬ 
cial  form  for  the  stripes.  Clearly,  an  implementation  which  is  designed  specifically  for 
matrices  generated  from  rectangular  grids  would  use  the  fact  that  all  the  stripes  are  parallel 
to  the  diagonal  of  the  matrix,  thus  eliminating  the  need  for  the  time  consuming  gather  and 
scatter  operations. 


precond. 

method 

one 

color 

global 

4-colors 

column-wise 

2-colors 

I/T 

(l) 

I/T 

0) 

I/T 

(i) 

SSOR 

136/22.7 

1.0 

171/13.5 

1.0 

140/8.5 

1.0 

lC{Pd ) 

126/21.1 

1.0 

170/13.5 

1.0 

132/8.1 

1.0 

/CCP„) 

96/17.4 

1.0 

166/16.7 

1.0 

109/8.1 

1.0 

I/T 

EH 

I/T 

EH 

I/T 

EM 

SSOR 

39/6.8 

mm, 

171/13.5 

1.0 

98/5.8 

1.6 

SIC  {Pd  .o>) 

39/6.9 

EH 

170/13.5 

1.0 

86/5.2 

SIC  {Pn  .«) 

37/7.5 

1.2 

166/16.7 

1.0 

71/5.7 

1.2 

no  prec. 

I/T  -  647/12.8 

Table  5  -  Results  for  Problem  2  with  FE 4  on  a  150x150  grid 


precond. 

method 

fe3 

FE 

9 

one 

color 

global 

4-colors 

column-wise 

2 -colors 

one 

color 

column-wise 

3-colors 

I 

u> 

I 

(D 

I 

0) 

I/T 

0) 

1/T 

0) 

SSOR 

108 

1.0 

ESI 

1.0 

ESI 

1.0 

109/19.3 

1.0 

114/9.9 

1.0 

IC(Pd) 

92 

1.0 

ESS 

1.0 

HU 

1.0 

97/17.3 

1.0 

103/9.1 

1.0 

I 

EH 

i 

EH 

i 

ea 

I/T 

wmm 

I/T 

EH 

SSOR 

35 

1.9 

ESI 

■El 

109 

35/6.4 

1.95 

86/7.5 

EB 

SIC  (P,  .t») 

32 

1.3 

EH 

EH 

105 

1.2 

46/8.4 

1.4 

80/7.1 

BH 

IEEV2EH! 

i  - 

565 

I/T -556/12.1 

Table  6  -  Results  for  Problem  2  with  FE3  and  FEq  on  a  99x99  grid. 


Problem  3:  (PROB  26  in  [18]) 

V  (l+x2)3  V  u  =  60  x  on  D 

u  =  0  on  QD 


where  D  is  the  pierced  rectangular  domain  of  Figure  3.  The  result  of  the  FE  d  discretization 
for  this  problem  using  a  100x200  pierced  rectangular  grid  is  given  in  Table  7. 


Table  7  -  Results  for  Problem  3  with  FE4 
By  studying  the  above  results,  we  may  note  the  following: 

1)  For  large  grid  sizes,  the  column-wise  numbering  scheme  is  consistently  better  than  the 
global  numbering  scheme.  In  fact,  it  seems  that  the  global  scheme  may  be  recommended 
(on  the  CRAY)  only  if  the  grid  size  is  relatively  small.  More  specifically,  if  the  zero  stretch 
for  the  column-wise  scheme  is  smaller  than  64.  Clearly,  for  such  small  problems,  the 
power  of  a  supercomputer  is  not  needed. 

2)  If  no  adaptive  procedure  is  incorporated  in  the  PCCG  solver  to  choose  the  optimal  oj.  then 
the  natural  choice  is  co=l.  In  this  case,  it  is  clear  from  the  results  that  the  use  of  the 
IC  ( Pd  )  preconditioner  instead  of  the  SSOR  preconditioner  reduces  the  number  of  PCCG 
iterations  for  both  the  one  color  scheme  and  the  column-wise  coloring  scheme.  The  advan¬ 
tage  of  the  1C  ( Pd  )  preconditioner  is  clear,  given  that  the  cost  of  the  computation  of  the 
IC  ( Pd  )  matrix  is  equal  to  the  cost  of  one  PCCG  iteration. 

3)  The  IC{Pa)  preconditioner  may  reduce  the  number  of  iterations  over  the  ICiPj) 
preconditioner.  However,  the  IC{Pd)/PCCG  runs  usually  faster  than  the  IC{Pa)/PCCG 
because  it  costs  too  much  to  compute  the  1C  (Pa  )  factorization.  Here,  we  would  like  to  note 
that  our  implementation  does  not  take  advantage  of  the  technique  of  Eisenstat  [6],  which  is 
described  at  the  end  of  Section  3.  to  reduce  the  work  in  SSOR  / PCCG  and  IC{Pd  )/PCCG 
Clearly,  the  use  of  this  technique  will  reduce  further  the  execution  time  of  these  methods 
compared  to  the  IC  ( Pa  )/PCCG  . 
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4)  If  a  one-color  numbering  scheme  is  applied,  then  it  is  possible  to  reduce  the  number  of 
iterations  in  the  SSOR/PCCG  drasticly  by  using  the  optimum  value  of  w.  This  drastic 
reduction  offsets  the  disadvantage  of  having  a  unit  zero  stretch.  In  other  words,  if  an  adap¬ 
tive  strategy  may  be  used  to  chose  the  optimum  <o.  then  the  one-color  numbering  scheme 
seems  to  be  the  most  appropriate  scheme  to  apply. 

5)  For  problems  that  are  well  conditioned,  as  for  example  problem  1.  the  fastest  way  to 
solve  the  problem  is  the  non-preconditioned  CG  method.  However,  this  is  not  the  case  for 
problems  2  and  3  which  are  ill  conditioned. 


vjv 


one 

color 

col-wise 

3-colors 

col-wise 

2-colors 

global 

4-colors 

global 

2-colors 

number  of  SSOR  /PCCG  iterations 

163 

187 

197 

204 

231 

number  of  1C  (Pd  )/PCCG  iterations 

136 

165 

193 

199 

231 

the  zero  stretch  A 

1 

50 

74 

5624 

11175 

Table  8  -  The  effect  of  A  on  the  solution  of  Problem  1  with  FD5  discretization 

on  a  150x150  grid  (o>=l). 

6)  There  seem  to  be  an  inverse  relation  between  the  value  of  the  zero  stretch  and  the  con¬ 
vergence  rate  of  the  PCCG.  In  order  to  be  more  specific,  we  consider  the  FDS  discretization 
of  Problem  1.  and  in  addition  to  the  numbering  schemes  reported  in  Table  2.  we  use  a  3- 
color.  column-wise  scheme,  and  a  4-color  global  scheme.  The  results  which  we  report  in 
Table  8  confirm  this  inverse  relation. 

Finally,  we  would  like  to  note  that  only  rectangular  and  pierced  rectangular  domains 
were  chosen  in  the  experiments  because,  first,  we  did  not  have  a  means  for  the  automatic 
generation  of  large  irregular  grids,  and  second,  the  numbering  algorithm  of  Section  4.3  is.  to 
our  knowledge,  the  first  algorithm  which  aims  at  increasing  the  zero  stretch  for  general 
matrices.  Hence,  its  performance  may  only  be  compared  to  an  algorithm  which  produces 
the  maximum  zero  stretch.  We  believe  that  such  an  algorithm  is  NP  complete. 
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APPENDIX 

Proof  of  Theorem  1  :  Let  i  and  j  be  two  nodes  in  some  element  e  in  O,  and  let  l , 
1^1  ^q  .  be  the  color  given  to  i  and  l  +  be  the  color  given  to  j .  where  l  +=l  +1  if  1^1 <q . 
and  1=1  if  l  =q  .  Note  that  either  1)  i  and  j  lie  in  the  same  column  of  nodes.  CNU .  in  ft  or 
2)  i  lies  in  CNU  and  j  lies  in  the  following  column  of  nodes,  namely  CNu+i.  Let  also  C£v 
be  a  column  of  elements  in  ft  which  contains  element  e  ,  and  let  ceu.  be  the  number  of  ele¬ 
ments  in  CEW . 

From  ALG6  and  ALG7,  it  is  clear  that  the  number  of  nodes  which  are  numbered 
between  nodes  i  and  j.  namely  v(  j  )—v(i  )+l,  is  equal  to  the  sum  of  1)  the  number  of 
nodes  with  color  l  which  are  above  node  i  (including  i  )  in  column  CNU .  and  2)  the  number 
of  nodes  with  color  Z+  which  are  below  node  j  (including  j  )  in  column  CNU  or  CNU+1. 
whichever  applies. 

Given  that  at  most  two  elements  in  the  same  column  may  share  a  single  node,  and  that 
for  any  specific  color,  every  element  should  contain  a  node  of  that  color,  we  may  conclude 
that  v(j  )—v(i  )+l  should  be  larger  than,  or  equal  to,  the  sum  of  1)  half  the  number  of  ele¬ 
ments  in  CEW  above,  and  including,  element  e  .  and  2)  half  the  number  of  elements  in  C£„ 
below,  and  including,  element  e  .  In  other  words. 

v(j  )  —  v(i  )  +  1  >  — - — 


But  A  =  minlK/  )— u(i  )).  where  e  may  be  in  any  column  in  SI.  Hence 


v  ce  mjn,4'  1 
A  +  1  25  — - 


which  gives  the  lower  bound. 

In  order  to  establish  the  upper  bound  on  A,  we  note  that  v(  j  )—v(i  )— 1  is  equal  to  the 
sum  of  nodes  of  color  l  above  node  i  and  the  nodes  of  color  Z+  below  node  j  (not  including 
i  and  j  ).  But.  each  element  in  fl  may  contain  at  most  one  node  of  any  given  color.  Then 


and  the  upper  bound  follows  directly.  □ 


Proof  of  Lemma  1:  Let  CNU  be  the  u,h  column  of  nodes  in  A  and  let  cnu  be  the  number  of 

nodes  in  this  column.  In  the  global  multicolor  numbering  scheme,  the  colors  1 . Ip  are. 

first,  used  to  color  the  nodes  in  CN  .  v  =0,1.2 . such  that  any  two  neighboring  nodes 

in  a  column  are  given  different  colors.  Then,  the  colors  'Ip  +1 . lip  are  used  to  color  the 

nodes  in  CN2+vy/j.  v  =0,1,2 and  so  on.  Thus,  if  the  given  color  l  satisfies 

(k-l)Ip  <1  ^  kip  .for  some  k  .  1  ^  Ip  .  then  the  number  of  nodes  in  A  which  are 

given  the  color  l .  namely  n  .  is  obtained  from 


n  n  =  Z  cnl 


•  =0,1.... 


+v  J p 


(14) 


where  cn‘  is  the  number  of  nodes  which  are  given  the  color  l  in  column  CNU . 

In  order  to  estimate  each  term  in  (14).  we  consider  a  particular  column  of  nodes  CNU . 

where  u  =k  +v  Ip  ,  for  some  v  ,  and  we  let  L  =(k  — 1  )Ip  .  Clearly,  the  colors  used  to  color 

the  nodes  in  CNa  are  A=Z.+1 . L+Ip  .  We  also  define  a  clique  in  CNU  as  a  subset  of 

* 

nodes  within  CNU  such  that  any  two  nodes  in  this  subset  are  neighbors.  Note  that  the 
nodes  in  CNU  which  lie  on  a  vertical  side  of  an  element  form  a  clique.  Let  clqu  be  the 
number  of  such  cliques  in  CNU .  Given  that  any  particular  node  in  CNU  may  be  in  at  most 
two  cliques,  we  obtain 

clqu  ^  2cnuK  A=L+1.  •  ,L+Ip  (I5.a) 


Also,  the  coloring  scheme  is  such  that  any  clique  may  contain  at  most  one  node  of  any  par¬ 
ticular  color.  This  gives 

clqu  ^  cnl  k=L  +1.  •  ,L  +Ip  (15. b) 


From  ( 1 5 .a )  and  (15. b).  we  get,  for  the  given  color  l . 
cnl  ^  2  cnuK  X =L  +1 ,  ■  .L  +Ip 

If  we  write  the  above  inequality  Ip  —  1  times  with  X =L  +1.  .L  +  v7-  A#/  .  sum  these 
inequalities,  and  then  add  2 cnj  to  both  sides  of  the  result,  we  get 
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a 


a 


. _  L  +yfp 

(' jp  +1)  cnj  ^  2  52  cnKX  =  2  cnu 

K—L 


Similarly,  from  (I5.a)  and  (15.b)  we  may  obtain 


,  >  1  x 
cnlu  *  —  cnuK 


\—L  +1,  •  •  •  ,L  +'J~p 


which  leads  to 


( 'fp  —  Y )  cnu  ^  j  cnu 

Equation  (16)  and  ( 1 7)  may  now  be  combined  into 


1  is  2 

— = - cnu  ^  cnl  ^  —== - cnu 

I'Jp  -1  vp  +1 


which,  together  with  (14)  gives 


.  f—  1  cnk+*yfJ  ^  nn  ^  /—  4-1  £  cn*+vVjT 

2vp  —1  v  =o.i....  vp  +1  v=0.1.... 

Finally,  we  need  to  relate  the  sum  in  (18)  to  the  total  number  of  nodes  ia  SI.  Instead 
of  dealing  with  complex  formulas,  we  treat  each  of  the  cases  FE4  and  FE 9  separately.  For 
FEa.  any  column  of  nodes.  CNV .  in  Si  is  enclosed  between  two  consecutive  columns  of  ele¬ 
ments.  This  implies  that  cnu  ^  cn„_i+cnu  +1.  Thus 

£  cnu  <  2  £  cn„  (19) 


Noting  that  51  cn„  +  X  cn„  -  n  n.  we  may  use  (19)  to  obtain 

u  ^odd  u  =*vcn 

Lcn«^fno  (20) 

The  same  bounds  (20)  may  be  obtained  for  £  cnu .  which  proves  the  following  for  FE4 


and  p  =4: 


v  .V  •- 


vvv.-,.-  . 


For  FEq  discretizations,  a  similar  procedure,  starting  from  cnu  $cnu+i  +  cnu_2  and 
cnu  +  cnu+2,  may  be  used  to  prove  (21)  for  the  case  p=  9.  The  result  (12).  then, 

follows  directly  from  (21)  and  (18).  □ 

Proof  of  Theorem  h  From  the  definition  of  A,  there  exist  two  node  i  and  j  in  an  element 
e  such  that  v(  j  )— v(i  )=A.  Let  CEU  be  the  column  of  elements  which  contains  e  .  and  let  l 
and  1+1  be  the  colors  of  nodes  £  and  j ,  respectively.  The  proof  proceeds  by  the  construc¬ 
tion  of  two  pierced  rectangular  domains  fij  and  Sl2  which  overlap  at  element  e  .  Namely, 
fl  i  consists  of  element  e  ,  the  elements  in  CNU  above  e  ,  and  the  elements  in  the  following 
columns  CNu+1.CNu+2-  '  '  '•  The  grid  02  consists  of  element  e  .  the  elements  in  CNU  below 
e .  and  the  elements  in  the  previous  columns  CN ■  ■  ■  ,CNU^.  Let  nftj  and  nftj  be  the 
numbers  of  nodes  in  fij  and  fl2.  respectively,  and  note  that  n  fij+n  n  =n  a+p  . 

From  the  global  numbering  scheme,  it  is  clear  that  the  difference  v(  j  )—v(i  )+l  is  equal 
to  the  number  of  nodes  in  flj  which  have  the  color  l .  namely  n'ftj .  plus  the  number  of 
nodes  in  fl2  which  have  the  color  /  +1.  namely  ntf* .  That  is 

v(j  )-*/(»)+  =  n'n,  +  In  ft*1  (23) 

But.  Hi  and  ft2  are  pierced  rectangular  domains,  and  thus,  n'fli  and  n'n+?l  obey  the  bound 
of  Lemma  1.  The  application  of  these  bounds  on  (23)  gives 

1  2  2  2 
— 7= -  I'ln1+*nJ  <  v(j)-v(i')  +  1  ^  = -  [nn  +nnl 

2v p  -1  vp  +1 

Finally.  (13)  results  by  using  the  definition  of  A  and  n  n=n  fi]+n  n  —p  .  □ 
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